Parameter estimation from aggregate observations: a Wasserstein distance-based sequential Monte Carlo sampler

In this work, we study systems consisting of a group of moving particles. In such systems, often some important parameters are unknown and have to be estimated from observed data. Such parameter estimation problems can often be solved via a Bayesian inference framework. However, in many practical problems, only data at the aggregate level is available and as a result the likelihood function is not available, which poses a challenge for Bayesian methods. In particular, we consider the situation where the distributions of the particles are observed. We propose a Wasserstein distance (WD)-based sequential Monte Carlo sampler to solve the problem: the WD is used to measure the similarity between the observed and the simulated particle distributions and the sequential Monte Carlo samplers is used to deal with the sequentially available observations. Two real-world examples are provided to demonstrate the performance of the proposed method.


Introduction
In transportation science, it is often of critical importance to study the collective behaviour of a group of moving objectives (referred to as particles hereafter). A good example of such problems is the dynamics of a crowd of human pedestrians, which has important applications in urban planning and safety management [1,2]. Other examples of such systems include traffic flows [3], swarm various models have been proposed in the past decades. Despite the modelling advances, some issues are yet to be solved. In particular, certain key information in the system may be missing, and as a result some important model parameters are not known to the practitioners. Examples of such a situation may appear in pedestrian crowds [5] and traffic flows [6,7]. In practice, these model parameters are often estimated by fitting the real-world observation data into the mathematical models, often via a Bayesian framework.
The Bayesian parameter estimation is conceptually straightforward and has been used to estimate model parameters in many related problems [8][9][10][11]. However, it can be highly challenging to apply the Bayesian method to the microscopic models of such systems-i.e. models that directly describe the dynamics of the individual particles in the system. A main difficulty is that, in order to conduct the standard Bayesian inference for a microscopic model, one needs to track each particle, which is extremely difficult or even impossible when the ensemble size is large. Instead, in reality, one often observes the aggregate data that are collected at the ensemble level and do not characterize the state of each individual particle. As will be explained later, determining a suitable distribution for aggregate measures in this case can be challenging, potentially resulting in the unavailability of an analytically derived likelihood function. Additionally, the presence of unknown measurement noise further complicates the tractability of the likelihood. As a consequence, standard posterior computation methods such as Markov Chain Monte Carlo (MCMC) cannot be used. A large class of approximate Bayesian inference techniques have been developed to address such likelihood-free problems, such as the approximate Bayesian computation (ABC) [12] and other methods, e.g. [13]. These techniques have been used to conduct likelihood-free inference from aggregate observations. For instance, macroscopic observation data measuring the flow is used to calibrate a microscopic model for crowd dynamics using ABC methods [10].
In the present work, we consider a special case of aggregate observations-namely, it is the distribution of the particles that is observed. Our main purpose is to develop a generic method that can compute the posterior in such problems, by taking advantage of the rich literature of the likelihood-free inference. As in our problem of interest, usually the data are observed in a sequential manner. We choose to base our method on the sequential Monte Carlo sampler (SMCS) [14], an extension of the particle filter, for its ability to deal with sequentially available data. Some variants of SMCS can deal with likelihood-free inference [15] and these methods typically rely on the distance (e.g. the Euclidean distance) between the simulated data and the observed data. To this end, a similar problem has been considered in the crowd dynamics context in [11], where a crowd flow forecasting method based on particle filter is proposed to estimate both the crowd state and latent parameters from the aggregate density observation data. In [11], the total-variation (TV) distance is used to measure the similarity between the observed and the simulated distributions. However, such distance measures may become ineffective when observation noise is present: in figure 1, we provide an example in which the TV distance between U and V and that between U and W are the same, which makes it unsuited for measuring the distance between the observed and the simulated distributions. To address the issue, we propose to use the Wasserstein distance (WD) [16] as a distance measure between the simulated and the observed distributions. The WD is a commonly used measure for similarity between two distributions, which loosely speaking, is the minimal cost for transforming one distribution into the other. In figure 1, the WD between U and V is computed as 0.03 while the WD between U and W is determined as 0.3. This discrepancy more reasonably characterizes the differences  Figure 1. An illustrative example of the TV distance and the Wasserstein distance where the histograms of three probability measures, U, V and W, are shown. The TV distance between U and V and that between U and W are the same. The WD between U and V is 0.03, while the WD between U and W is 0.3. See the main text for more information.
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 230275 between these distributions and addresses the aforementioned issue with the TV distance. WD has been widely used in many applications in statistics and machine learning, such as [17,18], as it has many desired properties. Additionally, it has been shown to be effective in mitigating information loss caused by the utilization of summary statistics in basic ABC techniques [19]. In this work we develop a WD-based SMCS method which can deal with the aforementioned problems where the observations are particle distributions (referred to as the aggregate data hereafter).
The rest of the work is organized as follows. We first discuss the likelihood-free sequential inference problem in §2, and present the proposed WD-based SMCS in §3. Numerical examples are provided in §4 to demonstrate the performance of the proposed method. Finally, §5 offers some conclusions.

Problem set-up
We start with a generic set-up of the sequential Bayesian inference problem considered here. Suppose that we have measurements y 1 , …, y T (with y t [ R ny for t = 1, …, T) that are collected sequentially in time from a dynamical system model, and from the data we want to estimate the unknown parameter u [ R nu . In such problems, after collecting t data points y 1 , …, y t , we can compute the posterior distribution of θ conditional on measurements y 1 , …, y t in the following form (assuming all the measurements are independent conditional on θ): where π 0 (θ) is the prior distribution of θ that is usually specified by the users, or equivalently, Our goal is to compute π T (θ), i.e. the final posterior distribution after collecting all data points y 1 , …, y T .

Sequential Monte Carlo sampler
Since the posterior distribution is typically not analytically available, a common practice is to characterize the distribution by drawing samples or realizations from it. In principle, the posterior π T (θ) can be sampled with the standard methods such as MCMC. However, as is pointed out in [20,21], the posterior in equation (2.1) poses challenges for usual MCMC methods especially when T is large, as they need to query the likelihood function π(·|θ) a large number of times. To this end, the SMCS method is particularly designed to exploit the sequential structure of such problems. SMCS is a sequential importance sampling method, and as such it generates weighted samples from the posterior distribution, where the samples represent possible values from the posterior, while the weights assign importance to each sample based on their contribution to capturing characteristics of the distribution. We here describe SMCS in a recursive formulation. Given an arbitrary conditional distribution L t−1 (θ t−1 |θ t ), one can construct the following joint distribution of θ t−1 and θ t : ð2:3Þ whose marginal distribution over θ t−1 is π t (θ t ). Importance sampling [22] is applied to draw samples for p t (θ t−1 , θ t ), where the proposal distribution q t (θ t−1 , θ t ) is constructed in the form of from which we can directly draw samples (θ t−1 , θ t ). Here q t−1 (θ t−1 ) is a marginal distribution and K t (θ t | θ t−1 ) a conditional distribution. The weights of samples are computed according to royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 230275 is the weight at t − 1 step, and is the incremental weight at the t-th step. Now that we have joint samples fðu j tÀ1 , u j t Þ, v j t g N j¼1 from the joint distribution p t (θ t−1 , θ t ), it follows that samples fu j t , v j t g N j¼1 correspond to the marginal distribution π t (θ t ). The complete SMCS algorithm proceeds as follows: (i) let t = 0, draw samples fu j 0 g N j¼1 from q 0 ( · ), and compute v Finally the procedure returns a set of weighted samples fu j T , v j T g N j¼1 from the posterior π T (θ). Note that in SMCS algorithms, a resampling step is commonly used to alleviate the 'sample degeneracy' issue [14]. When the effective sample size (ESS) falls below a specific threshold (usually less than half the total number of samples), resampling is performed on the proposed samples to mitigate this issue.
From the discussions above, we can see that in SMCS the forward kernel K t (θ t |θ t−1 ) and the backward kernel L t−1 (θ t−1 |θ t ) are critical for the performance of the method and therefore have to be chosen carefully. Here, we adopt the MCMC moves developed in [14]. The sampling step (iii) in the SMCS algorithm with this forward kernel is constructed as follows. First, we choose a proposal distribution k(θ t |θ t−1 ) and draw a sample u Ã t from it with a sample u j tÀ1 from the previous iteration. Next the sample u Ã t is accepted or rejected according to the following acceptance probability: With this forward kernel, an approximate optimal backward kernel can be derived as where the detailed derivation can be found in [14] and is not repeated here. With the choice of this approximate backward kernel, the incremental weight function α t (θ t−1 , θ t ) in equation (2.5c) reduces to and when applied to the posterior distribution in equation (2.2), it becomes, Finally, we provide some remarks on the SMCS method: -First we reinstate that SMCS is a special type of importance sampling (IS) approach, as it draws samples from a proposal distribution instead of the actual posterior, while correcting for the bias by assigning proper weights to each samples. Therefore, just like the standard IS, the method produces a set of sample-weight pairs which follow the posterior distribution. We refer to [22] for more details of IS. The main reason to use such a method is that it is challenging to draw samples directly from the posterior. SMCS is special as it constructs the joint target distribution as well as the joint proposal distribution in a special way, using the forward and backward kernel functions. -The functions K t (θ t |θ t−1 ) and L t−1 (θ t−1 |θ t ) are both conditional distributions used within the SMCS method. They are not part of the inference problem and should not be regarded as likelihood functions. Instead, they can be understood as some auxiliary functions required by the SMCS method. More precisely they are used in constructing joint distributions along with a marginal distribution, and the terms 'forward' and 'backward' are associated with the dependence of their arguments.

Likehood-free SMCS
It is easy to see that, to apply SMCS to the sequential inference problem described in §2.1, one needs the knowledge of the likelihood function π(·|θ). In many real-world applications, including the problems of our interest in this paper, often the likelihood function is not explicitly available, and instead there exists a simulation model that can generate synthetic observation data that distribute according to the likelihood function π(·|θ), given a parameter value θ. In this case the approximate (or likelihood-free) inference methods can be used. While noting that many such approximate inference methods have been proposed, we focus on the one based on SMCS, which was modified from [23].
Recall that, in the SMCS algorithm, the likelihood function is evaluated on two occasions: the first is to calculate the acceptance probability in equation (2.6) and the second is to update the sample weights in equation (2.9). The main idea of the approximate method is rather simple: one first generates a synthetic observation dataŷ from the simulation model and approximates the value of the likelihood function with a kernel function H( · , · ) that characterizes the similarity betweenŷ and the actual data y: pðyjuÞ % Hðŷ, yÞ, ð2:10Þ whereŷ is the synthetic observation data generated from pðŷjuÞ. We reinstate here that the kernel function H is chosen to characterize the similarity betweenŷ and y, and as such, the more similarŷ and y are, the higher the likelihood function value is.

Aggregate observations and the Wasserstein distance
The application problems of our interest are a special case of the sequential inference problems described in §2. Specifically, we consider the dynamics of an ensemble of n particles (such as pedestrians or vehicles), which is described by a discrete-time dynamical system where x t = (x 1t , …, x nt ) with x it (for i = 1, …, n) being the position of the i-th particle at time t, and θ is the unknown model parameter that we want to estimate. In the standard set-up of such problems, one is able to observe the particle position x t at each time t, from which the parameter θ is inferred, and in this case, the likelihood function is usually available. However, in many real-world problems, especially when the number of particles is large, it is not possible to track and locate each particle, and instead it is often much easier to observe how the particles are distributed at a given time t, often referred to as aggregate observations. Namely, at each t = 1, …, T, we observe the distribution of x t , denoted as ρ t (x), and we aim to infer parameter θ from ρ t (x). It should be clear that, since the relation between the parameter and the aggregate observations are complex, it is usually challenging to derive the likelihood function analytically. As such likelihood-free methods need to be used for such problems. We adopt the likelihood-free SMCS method described in §2.3 to solve the problem, and recall that, in this framework, the key issue is to choose the kernel function H( · , · ) that measures the similarity between two data points that are essentially two distributions in the present problem. To this end, we choose to define the kernel function H via the WD of two distributions. The WD, also known as the earth mover's distance (EMD), is a popular choice for measuring dissimilarity or distance between two distributions, and widely used in real-world applications. Intuitively speaking the WD is calculated as the minimum 'cost' of transforming one distribution into the other. In what follows we discuss how to calculate the WD between two discrete-valued distributions. Let r U ¼ fðu 1 . . , ðv n , p v n Þg be two discrete-valued distributions, where p u i represents the density at the data point u i . A function c(u i , v j ) is further defined as the cost for transporting a unit from u i ∈ U to v j ∈ V, which is taken to be the Euclidean distance between u i and v j in this work. To calculate the WD, we first need to compute a 'transport map' γ(u i , v j ) that minimize the overall cost royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 230275 subject to the following constraints: Once the optimal transport map γ is found, the WD is defined as For more details of the WD, we refer to [16] and the references therein. It follows that the kernel function can be defined as where h is the bandwidth parameter of the kernel. It is of essential importance to choose an appropriate value for h as it directly affects the sensitivity of the kernel to the WD value. The proper choice of h is usually problem dependent, and more importantly, for a particular problem, as the WDs of samples can vary across iterations, a fixed value of h may not be suitable for all the iterations. As a result, we adopt an adaptive approach where h is dynamically determined for each iteration. Namely, we set h to be the median WD among the samples. This adaptive strategy allows us to adjust the value of h according to varying WDs encountered during the estimation process, ensuring an appropriate level of sensitivity for the kernel. Finally we present the complete algorithm of the WD-based SMCS in algorithm 1.

Numerical experiments
In this section, we consider two application problems, with which we conduct numerical experiments to examine the performance of the proposed method. The first example is the classic social force model (SFM) for the crowd dynamics of pedestrians. The second is the intelligent driver model (IDM) for the collective behaviour of traffic flow. We discuss the model details and the setup of the simulation scenarios in the following sections.

The social force model
Our first example is the dynamics of pedestrian crowds, and in particular we consider a typical scenario of collective escape towards a single exit (see figure 2, left). We adopt the SFM [24] to describe the crowd behaviour. Simply speaking SFM assumes a mixture of physical and socio-psychological forces influencing the crowd behaviour by considering personal motivations and environmental constraints. In this model, each pedestrian i of mass m i and velocity v i tends to move by a desired speed v p i along a certain direction e p i during the acceleration time τ i . The resulting personal desire force F p is Furthermore, pedestrians psychologically tend to keep a social distance between each other and avoid hitting walls. This is modelled by introducing 'interaction force' f ij between pedestrians i and j and f iW between pedestrian i and the wall, respectively. The total interaction force is Combining equations (4.1) and (4.2), we obtain the acceleration equation The position vector r i (t) is updated by the velocity v i (t) = dr i /dt.
The interaction force f ij between pedestrian i and j is specified as follows. With the distance d ij = ‖r i − r j ‖ between the two pedestrians' centres of mass, the psychological tendency of pedestrian i to stay away from pedestrian j is described by a repulsive interaction force A i exp[(r ij − d ij )/B i ]n ij , where A i and B i are constants, indicating the strength and the range of the interaction, and n ij ¼ ðn 1 ij , n 2 ij Þ ¼ ðr i À r j Þ=d ij is the normalized directional vector pointing from pedestrian j to i. The pedestrians touch each other if their distance d ij is smaller than the sum r ij = r i + r j of their radius r i and r j . In our model, we specify a uniform value for the size of each pedestrian (table 1) [24]. Inspired by granular interactions, two additional forces are included in the model, which are essential for understanding the particular effects in panicking crowds: a 'body force' k(r ij − d ij )n ij counteracting body compression and a 'sliding friction force' kðr ij À d ij ÞDv t ji t ij impeding relative tangential motion, if pedestrians i and j are close enough. Here, t ij ¼ ðÀn 2 ij , n 1 ij Þ means the tangential direction and Dv t ji ¼ ðv j À v i Þ Á t ij the tangential velocity difference, while k and κ are large constants, representing the bump and the friction effect. In summary, the interaction force f ij between pedestrians i and j is given by where the indicator function I(r ij − d ij ) is zero for r ij − d ij < 0 and it is equal to r ij − d ij otherwise. The interaction with the walls is treated analogously. By denoting d iW as the distance to wall W, n iW as the direction perpendicular to it, and t iW as the direction tangential to it, we have ð4:5Þ With the SFM described above, we simulate the collective escape of pedestrians towards a single exit in a room. Initially, a total of 100 pedestrians are randomly distributed within a square room measuring 10 m on each side. The room has an exit with a width of 2 m. The parameter values used to simulate the data are given in table 1, largely following [24]. We note here that in most of the real-world problems, the individuals are subject to different parameter values. To reduce the computational cost, we simplify the problem by assuming that the values for m i , v p i , τ i , r i , A i and B i are identical for all individuals, as are in [10,11]. Among these parameters we assume that v p , A and B are unknown and need to be inferred from the observation data. Namely, suppose we can observe the positions of pedestrians x t at different times t obs = t × Δ t for t = 1, …, T, and we aim to estimate the aforementioned parameters in the SFM with these sequential observations. The observation noise is assumed to be a zero-mean Guassian with standard deviation σ. At every observation time, noise is added to the actual positions of each particle and the underlying distribution of this noisy observation is estimated by its empirical densities. We emphasize that we are able to observe the particle locations, but we are unable to track each particle. In this experiment, we take Δ t = 0.1 s and T = 30 and the simulation time step in SFM is dt = 0.001 s. The prior distribution for u ¼ ðA, B, v p Þ ⊺ is taken to be uniform: A ∼ U[1200, 2200], B ∼ U[0.02, 0.2] and v p ∼ U[0, 1.5]. It is important to note that the SFM can encounter issues with unphysical parameter values that result in body compression and abrupt changes in velocities. Thus appropriate prior distributions of parameters are necessary to prevent such unrealistic simulations. Additionally, the simulation program includes an inspection procedure to identify and replace any unreasonable samples that may produce unrealistic signals during the simulation process. This approach enhances the stability of the simulation process and allows for more flexibility in setting parameter priors. We simulate the observation data with two different noise levels: σ = 0.04 and σ = 0.1, and use 500 samples in WD-SMCS to estimate parameters at each observation time. For each sample, the synthetic dataŷ t ¼ r t ðxÞ is the approximated density of particle locations x t at observation time t. These particle locations are generated by the SFM simulation with the parameters assigned the values of the corresponding samples. Importantly, in the simulation step, no noise is added to the particle locations due to our assumption that we have no prior knowledge about the measurement noise.
We first consider the small noise case where σ = 0.04. In figure 3, we plot the posterior means and the standard deviations against the number of iterations for all the three parameters. As can be seen from the figure, the mean values of all three parameters converge within 25 sequential observations, and the converged values are rather close to the ground truth, indicating good inference accuracy in the case when the observation noise is small. Recall that, a key element in the proposed method is that the simulated distribution should be close to the observed one. To facilitate the comparison of crowd distributions, we approximate the density map of particle locations using the kernel density estimation (KDE) [25] method. In figure 4, we compare the observed particle density and that simulated with the posterior means (referred to as the posterior-simulated density) at three time steps: t = 1, 11, 25; we also show the densities from the prior means (referred to as the prior-simulated density) and the ground truth as a reference. For the two simulated densities, we also calculate the WD between them to the observed one. It can be seen that as the iteration proceeds (and therefore more data is available) the density associated with the prior means deviates from the observation, the posterior prediction density is closer to the observation with a smaller WD, compared to the prior prediction density. Next , the prior prediction (third row) and the posterior prediction (last row) in terms of crowd density map at specific time points. The prior prediction is based on the mean values of parameter priors, while the posterior prediction is generated with the posterior means. WD0 is the WD between the ground truth and the observation. WD1 is the WD between the observed density and the prior-simulated one, and WD2 is that between the observed density and the posterior-generated one.
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 230275 we consider the case of larger noise, i.e. σ = 0.1. As before we first plot the posterior means against the number of iterations in figure 5. One can see that, while the posterior means also approach to the true value as the iteration proceeds, more iterations (and hence more data) are required in order to accurately infer the parameters (especially A), due to the larger observation noise. We then show the same particle density plots as those in figure 4. One can see here that figure 6 is qualitatively similar to figure 4, which once again demonstrates that the proposed method can successfully obtain parameters values that can drive the particle density towards the observed one.  Figure 6. (σ = 0.1) Comparison among the ground truth (first row), the observation (second row), the prior prediction (third row) and the posterior prediction (last row) in terms of crowd density map at specific time points. The prior prediction is based on the mean values of parameter priors, while the posterior prediction is generated with the posterior means. WD0 is the WD between the ground truth and the observation. WD1 is the WD between the observed density and the prior-simulated one, and WD2 is that between the observed density and the posterior-generated one.
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 230275 We conduct further testing of the proposed algorithm using real-world data. Specifically, the experimental data [26] are collected in a corridor scenario leading straight to an entrance gate, which is common for concerts or other events. Initially, there are 75 participants waiting in the corridor, which has a width of 5.6 m. The exit is located along the negative y-axis and has a width of 0.5 m. Once instructed, participants start moving towards the entrance gate, and their locations are recorded during the process. More detailed information about the experiment can be found in [26]. We assume that the crowd dynamics in the corridor follow the SFM, and we use the observed distributions to infer the model's parameters. For this experiment, we take Δ t = 0.2 s and T = 40. The parameters to be inferred and their priors are the same as before. In figure 7, we plot the posterior means and the standard deviations against the number of iterations for all the three parameters. From the figure, it can be observed that the mean value of the parameter v p converges within 30 sequential observations, while the mean values of the other two parameters, A and B, remain highly variable. Given the unavailability of the actual model and parameter values, a reasonable practice is to assess whether the estimated model can capture the observed data patterns. Figure 8 shows the comparison between densities. One can see that both the prior-generated and the posterior-generated densities present significant discrepancies compared to the observed density. However, the posterior density demonstrates a smaller WD to the observed density compared to the prior density. We note that the performance is ultimately limited by SFM, which may not accurately describe this particular scenario. Furthermore, the simplification of assuming identical parameters for each participant also affects the performance. Nevertheless, we can still extract some useful information from the data using the proposed method.

The intelligent driver model
Our second example is the traffic on a four-lane highway (see figure 2, right), described by the IDM [27]. IDM is a widely adopted car-following model in microscopic traffic simulation. In this type of models, as its name suggests, the i-th vehicle follows the (i − 1)-th vehicle in front of it. The IDM assumes the acceleration of a follower i is a continuous function of a set of inputs-the follower's current velocity v i ; the gap distance s i to the leader; and the relative velocity (approach rate) Δv i with respect to the preceding vehicle. The dynamics equation is defined as ð4:6Þ in which royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 230275 acceleration/deceleration. In this example, we study the case of identical vehicles whose model parameters v With the IDM described above, we simulate the traffic flow on a four-lane highway. Each lane spans 300 m and starts empty. Vehicles arrive at a rate of three vehicles per second, which is intentionally chosen to ensure a sufficiently busy road, where interactions between vehicles occur frequently. Suppose that we are able to observe the vehicle locations x t at a sequence of discrete-time points: t obs = t × Δ t for t = 1, …, T, and we want to estimate the parameters u ¼ ðv 0 , a, T s Þ ⊺ in the IDM with these observed data with all other parameters known. The measurement noise is assumed to be a zero-mean Guassian with standard deviation σ. This noise is added to the actual positions of each vehicle at every observation time. In this experiment, we take Δ t = 1.  Figure 8. Comparison among the observation (first row), the prior prediction (second row) and the posterior prediction (last row) in terms of crowd density map at specific time points. The prior prediction is based on the mean values of parameter priors, while the posterior prediction is generated with the posterior means. WD1 is the WD between the observed density and the prior-simulated one, and WD2 is that between the observed density and the posterior-generated one. and σ = 1.0, and use 500 samples in WD-SMCS to estimate parameters at each observation time. For each sample, the synthetic dataŷ t ¼ r t ðxÞ is the approximated density of vehicle locations x t at observation time t. These vehicle locations are generated by the IDM simulation with the parameters assigned the values of the corresponding samples. We first consider the small noise case with σ = 0.1 and plot the posterior means and standard deviations of the parameters against the number of iterations in figure 9. As can be seen from the figure, the mean estimated values of all three parameters converge to almost the ground truth within 15 sequential observations, which indicates the accurate inference in the case when the observation noise is small. The plots for the large noise case (i.e. σ = 1.0) are shown in figure 10, where we can see that it requires more data points to infer the parameters (especially a and T s ), but rather accurate results can still be obtained within 25 sequential observations. We also provide comparison of the vehicle distributions in figures 11 and 12 for the small and the large noise, respectively. In this example, comparing the distribution of vehicles based on their locations is more straightforward and visually interpretable due to the clear separation by lanes. Additionally, to accommodate the space limitations in the paper, we plot a subset of vehicle locations over a specific interval on the highway, covering approximately 100 m. Specifically, the figures show the observed vehicle distributions (blue), the posterior-simulated (green) and the prior-simulated (red) vehicle distributions. Visually in both figures the posterior-simulated distribution is rather close to the observation, while the prior-simulated one clearly deviates. Quantitatively the WD between the posterior-simulated distribution and the   Figure 11. (σ = 0.1) The observed vehicle distribution (blue) compared against the ground truth (black), the posterior-simulated (green) and the prior-simulated (red) distributions over a specific interval on the highway. WD0 is the WD between the observed distribution and the ground truth, WD1 is the WD between the observed distribution and the prior-simulated one, and WD2 is that between the observed distribution and the posterior-simulated one.
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 10: 230275 13 observed one is much smaller than that between the prior-simulated and the observed ones: they are 19.8 versus 1.24 in the small noise case and 30.2 versus 4.47 in the large noise case.

Conclusion
In summary, we consider in this work the problem of estimating parameters in a many-particle system. In particular, we aim to address the issue that in such problems it is often possible to observe the distributions of the particles rather than the trace of each individual particle. In this case, the likelihood function is not available and we adopt a likelihood-free SMCS method where the similarity (distance) between the observed data and the simulated data need to be characterized. Since the observed data is actually a distribution, we propose to use the WD to measure the similarity between the observed and the simulated data. Numerical experiments are also provided to demonstrate the performance of the proposed method. Our belief is that the proposed method has potential applications in a variety of realworld problems that involve the dynamics of multiple particles, particularly in the field of transport science. We intend to explore these potential applications in future research. The observed vehicle distribution (blue) compared against the ground truth (black), the posterior-simulated (green) and the prior-simulated (red) distributions over a specific interval on the highway. WD0 is the WD between the observed distribution and the ground truth, WD1 is the WD between the observed distribution and the prior-simulated one, and WD2 is that between the observed distribution and the posterior-simulated one.